Statistics and Machine Learning

Data Science Nigeria AI Bootcamp

Kris Sankaran
November 13, 2020

Session Outline

Learning Process

What is the distinction?

Statistics Machine Learning
Origins in mathematics Origins in engineering
Connections to philosophy Connections to neuroscience
Emphasis on scientific insight Emphasis on automated systems

Part 1: Bayes’ Rule and Variational Auto-Encoders

Inference & Prediction

Classical Bayes

Bayes’ rule \[\begin{align*} p\left(\theta \vert x\right) &= \frac{p\left(x \vert \theta\right)p\left(\theta\right)}{p\left(x\right)} \\ &\propto p\left(x\vert\theta\right)p\left(\theta\right) \end{align*}\]

Example: Beta-Binomial

Example: Beta-Binomial

Example: Beta-Binomial

History: Laplace’s Sunrise Problem

In 1814, Pierre-Simon Laplace asked,

What is the probability that the sun will rise tomorrow?

which is the first place where this Beta-Binomial model appeared.

Derivation: Prior

The Beta prior density is defined over \(\left[0, 1\right]\) and has the form,

\[ p\left(\theta; a_0, b_0\right) \propto \theta^{a_0 - 1}\left(1 - \theta\right)^{b_0 - 1} \]

The fact that it flexibly parameterizes numbers between \(\left[0, 1\right]\) makes it a good candidate for a prior over probabilities.

Q: How does the shape of the density change when you vary \(a_0\) and \(b_0\)? Hint: This demo

Derivation: Likelihood

The binomial distribution is a model for the number of heads \(x \in \left\{1, \dots, n\right\}\) seen after \(n\) independent flips of a coin with probability \(\theta\). It has the form,

\[ p\left(x \vert \theta\right) \propto {n \choose x} \theta^{x}\left(1 - \theta\right)^{n - x} \] If \(\theta = 0.1\), how would this figure change? What about \(\theta = 0.9\)? (Hint)

Derivation: Posterior

Using Bayes’ rule, we can compute the posterior, \[\begin{align*} p\left(\theta \vert x\right) &\propto p\left(x \vert \theta\right)p\left(\theta\right) \\ &\propto {n \choose x}\theta^{x}\left(1 - \theta\right)^{n - x}\theta^{a_0}\left(1 - \theta\right)^{b_0} \\ &= \theta^{a_0 + x - 1}\left(1 - \theta\right)^{b_0 + n - x - 1} \end{align*}\] which is still a Beta distribution, but with new parameters \(a_0 + x\) and \(b_0 + n - x\). How does its shape change depending on whether \(x\) is very large or very small?

Simulation

After seeing more and more coin flips, the posterior becomes more sure that the underlying probability is about 0.2.

Non-Conjugate Models

The Variational Idea

The hope is to find \(q^{\ast} \in \mathcal{Q}\) that’s close to the true posterior.

Variational Auto-Encoder (VAE) Model

Motivating Example: MNIST

We’ll want to arrive at a representation like this. The latent space (center) clearly distinguishes between different MNIST classes. From a given point in the latent space, we can generate many different images.

VAE Model (Decoding)

Intuition: Decoder

Here \(z_i\) is 2D, and determines the shape of the observed digit. \(x_i\) is a vector whose elements correspond to pixels in the image. Different \(z_i\)’s are associated with different per-pixel normal distributions, parameterized by \(\mu_{\theta}\) and \(\sigma_{\theta}^{2}\).

Intuition: Decoder

Since the decoder defines a distribution \(p_{\theta}\left(x_i \vert z_i \right)\) for each fixed \(z_i\), we can sample many reconstructions. Since the blue curves (defined by \(\mu_{\theta}\) and \(\sigma_\theta\)) stay the same, the basic shape of the digit doesn’t change. However, individual pixel values (the green bars) change.

Intuition: Decoder

Typically, the \(z_i\)’s will be higher-dimensional (e.g., \(d = 20\) in the original paper). The same intuition carries over, except the \(z_i\) have values along \(d\) coordinates.

VAE Model (Encoder)

Intuition: Encoder

Given a particular image, we update the prior to a posterior in the latent space. This is an example of the prior.

Intuition: Encoder

The center of the posterior is given by \(\mu_{\varphi}\left(x_i\right)\) and the widths about each axis are given by \(\sigma_{\varphi}^2\left(x_i\right)\).

Intuition: Encoder

Typically the latent space will be more than two dimensional. Here’s what the prior would look like for larger \(d\).

Impelmentation: Encoder

# VAE model
class VAE(nn.Module):
    def __init__(self, image_size=784, h_dim=400, z_dim=20):
        super(VAE, self).__init__()
        self.fc1 = nn.Linear(image_size, h_dim)
        self.fc2 = nn.Linear(h_dim, z_dim)
        self.fc3 = nn.Linear(h_dim, z_dim)
        self.fc4 = nn.Linear(z_dim, h_dim)
        self.fc5 = nn.Linear(h_dim, image_size)
        
    def encode(self, x):
        h = F.relu(self.fc1(x))
        return self.fc2(h), self.fc3(h) # (mu, log_var)
    ...

Example Encodings

Here we are wandering across some path in the latent \(z\) space and observing the associated images \(\mu_{\theta}\left(z\right)\).

Optimization

Why the ELBO?

\[\begin{align*} \log p_{\theta}\left(x\right) &= \mathbb{E}_{q}\left[\log p_{\theta}\left(x\right)\right] \\ &= \mathbb{E}_{q}\left[\log \frac{p\left(x, z\right)}{p_{\theta}\left(z \vert x\right)}\right] \\ &= \mathbb{E}_{q}\left[\log \frac{p_{\theta}\left(x, z\right)}{q\left(z \vert x\right)} \frac{q\left(z \vert x\right)}{p_{\theta}\left(z \vert x\right)}\right] \\ &= \mathbb{E}{q}\left[\log p_{\theta}\left(x \vert z\right)\right] - D_{KL}\left(q\left(z \vert x\right) \vert \vert p\left(z\right)\right) + D_{KL}\left(q\left(z \vert x\right) \vert \vert p_{\theta}\left(z \vert x\right)\right) \end{align*}\]

Studying the bound

\[\begin{align*} \log p_{\theta}\left(x\right) &= \color{#6c9fb3}{\mathbb{E}{q}\left[\log p_{\theta}\left(x \vert z\right)\right]} - \color{#de784d}{D_{KL}\left(q\left(z \vert x\right) \vert \vert p\left(z\right)\right)} + \color{#445da5}{D_{KL}\left(q\left(z \vert x\right) \vert \vert p_{\theta}\left(z \vert x\right)\right)} \\ &\geq \color{#6c9fb3}{\mathbb{E}_{q}\left[\log p_{\theta}\left(x \vert z\right)\right]} -\color{#de784d}{ D_{KL}\left(q\left(z \vert x\right) \vert \vert p_{\theta}\left(z\right)\right)} \end{align*}\]

Optimization

Q: Why not?

Reparameterization Trick

Optimizing \(\varphi\) is hard, because you can’t just take the gradient step \[\begin{align*} \nabla_{\varphi} \mathbb{E}_{q_{\varphi}}\left[\log p_{\theta}\left(x \vert z\right)\right] \end{align*}\]

Taking the derivative under the integral, we see that \[\begin{align*} \nabla_{\varphi} \mathbb{E}_{q_{\varphi}}\left[\log p_{\theta}\left(x \vert z\right)\right] &= \int \log p_{\theta}\left(x \vert z\right) \nabla_{\varphi}q_{\varphi}\left(z \vert x\right) dz \end{align*}\]

is no longer an expectation over \(q_{\varphi}\), so we can’t approximate it using samples from \(q_{\varphi}\).

Reparameterization Trick

Notice however that \[\begin{align*} z \vert x &\sim \mathcal{N}\left(z \vert \mu_{\varphi}\left(x\right), \sigma^2_{\varphi}\left(x\right)\right) \end{align*}\] is equivalent to \[\begin{align*} \epsilon &\sim \mathcal{N}\left(0, I\right) \\ z \vert x, \epsilon &\equiv \mu_{\varphi}\left(x\right) + \sigma^2_{\varphi}\left(x\right) \odot \epsilon \end{align*}\] which decouples the source of randomness from the parameters \(\varphi\).

From random to deterministic

Instead of working with multiple gaussian densities (one for each \(\mu_{\varphi}\) and \(\sigma_{\varphi}^2\)), we work with one and pass it through different linear transformations.

Code Example

# VAE model
class VAE(nn.Module):
    def __init__(self, image_size=784, h_dim=400, z_dim=20):
        super(VAE, self).__init__()
       ...
       
    def reparameterize(self, mu, log_var):
        std = torch.exp(log_var/2)
        eps = torch.randn_like(std)
        return mu + eps * std
        
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_reconst = self.decode(z)
        return x_reconst, mu, log_var

Optimization after Reparameterization

Code Example

for i, (x, _) in enumerate(data_loader):
    # Forward pass
    x_reconst, mu, log_var = model(x)
    
    # Compute reconstruction loss and kl divergence
    reconst_loss = F.binary_cross_entropy(x_reconst, x, reduction="sum")
    kl_div = - 0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    
    # Backprop and optimize
    loss = reconst_loss + kl_div
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Part 2: Representational Analysis

Tools for Introspection

Representational Analysis

Plan

Toy Example

Toy Example

Training

This is what the fitted function looks like as we train it. A colab notebook with all the code is here.

Representations

We can plot the activations for different layers (and at different stages of learning). Each row gives the activations for a particular neuron. Each column gives a value of \(x_i\). It looks like groups of neurons activate when the data lie in particular regions of the input space.

Representations

This is what the second layer looks like at convergence.

Representations

Here is the final layer, but before the model has fully converged.

Formalization

Canonical Correlation Analysis (CCA)

CCA Geometry

CCA Geometry

CCA Geometry

Canonical Correlation Analysis (CCA)

Interpretation

Back to Deep Learning

Why the Singular Value Decomposition (SVD)?

SVCCA Recipe

For a concise summary of the similarity between \(X\) and \(Y\), use the average \(\rho_{k}\) across several directions.

Examples: Layer 2 across epochs

Examples: Layer 4 across epochs

Examples: Layer 2 vs. 4

Real-world Application

One compelling example of this method is to the study of representations learned by deep learning systems used in medical applications. A representation analysis makes it clear that transfer learning is most useful for the lower layers in the network and that higher-level layers change substantially during fine-tuning.

Conclusion

Review

Other topics in Statistics and ML

Perspectives on Statistics and ML

Last Word